duplicate detection
Critical appraisal of artificial intelligence for rare-event recognition: principles and pharmacovigilance case studies
Noren, G. Niklas, Meldau, Eva-Lisa, Ellenius, Johan
Many high-stakes AI applications target low-prevalence events, where apparent accuracy can conceal limited real-world value. Relevant AI models range from expert-defined rules and traditional machine learning to generative LLMs constrained for classification. We outline key considerations for critical appraisal of AI in rare-event recognition, including problem framing and test set design, prevalence-aware statistical evaluation, robustness assessment, and integration into human workflows. In addition, we propose an approach to structured case-level examination (SCLE), to complement statistical performance evaluation, and a comprehensive checklist to guide procurement or development of AI models for rare-event recognition. We instantiate the framework in pharmacovigilance, drawing on three studies: rule-based retrieval of pregnancy-related reports; duplicate detection combining machine learning with probabilistic record linkage; and automated redaction of person names using an LLM. We highlight pitfalls specific to the rare-event setting including optimism from unrealistic class balance and lack of difficult positive controls in test sets - and show how cost-sensitive targets align model performance with operational value. While grounded in pharmacovigilance practice, the principles generalize to domains where positives are scarce and error costs may be asymmetric.
- Information Technology (0.93)
- Health & Medicine > Therapeutic Area > Obstetrics/Gynecology (0.48)
MHSNet:An MoE-based Hierarchical Semantic Representation Network for Accurate Duplicate Resume Detection with Large Language Model
Li, Yu, Chen, Zulong, Xu, Wenjian, Wen, Hong, Yu, Yipeng, Yiu, Man Lung, Yin, Yuyu
To maintain the company's talent pool, recruiters need to continuously search for resumes from third-party websites (e.g., LinkedIn, Indeed). However, fetched resumes are often incomplete and inaccurate. To improve the quality of third-party resumes and enrich the company's talent pool, it is essential to conduct duplication detection between the fetched resumes and those already in the company's talent pool. Such duplication detection is challenging due to the semantic complexity, structural heterogeneity, and information incompleteness of resume texts. To this end, we propose MHSNet, an multi-level identity verification framework that fine-tunes BGE-M3 using contrastive learning. With the fine-tuned , Mixture-of-Experts (MoE) generates multi-level sparse and dense representations for resumes, enabling the computation of corresponding multi-level semantic similarities. Moreover, the state-aware Mixture-of-Experts (MoE) is employed in MHSNet to handle diverse incomplete resumes. Experimental results verify the effectiveness of MHSNet
- Asia > South Korea > Seoul > Seoul (0.06)
- Asia > China > Zhejiang Province > Hangzhou (0.05)
- Asia > China > Hong Kong (0.05)
- (5 more...)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
On Parallel External-Memory Bidirectional Search
Siag, Lior, Shperberg, Shahaf S., Felner, Ariel, Sturtevant, Nathan R.
Parallelization and External Memory (PEM) techniques have significantly enhanced the capabilities of search algorithms when solving large-scale problems. Previous research on PEM has primarily centered on unidirectional algorithms, with only one publication on bidirectional PEM that focuses on the meet-in-the-middle (MM) algorithm. Building upon this foundation, this paper presents a framework that integrates both uni- and bi-directional best-first search algorithms into this framework. We then develop a PEM variant of the state-of-the-art bidirectional heuristic search (BiHS) algorithm BAE* (PEM-BAE*). As previous work on BiHS did not focus on scaling problem sizes, this work enables us to evaluate bidirectional algorithms on hard problems. Empirical evaluation shows that PEM-BAE* outperforms the PEM variants of A* and the MM algorithm, as well as a parallel variant of IDA*. These findings mark a significant milestone, revealing that bidirectional search algorithms clearly outperform unidirectional search algorithms across several domains, even when equipped with state-of-the-art heuristics.
- North America > Canada > Alberta (0.14)
- Asia > Vietnam > Hanoi > Hanoi (0.05)
- Asia > Middle East > Israel (0.04)
- (2 more...)
Parallel Strategies for Best-First Generalized Planning
Fernández-Alburquerque, Alejandro, Segovia-Aguas, Javier
In recent years, there has been renewed interest in closing the performance gap between state-of-the-art planning solvers and generalized planning (GP), a research area of AI that studies the automated synthesis of algorithmic-like solutions capable of solving multiple classical planning instances. One of the current advancements has been the introduction of Best-First Generalized Planning (BFGP), a GP algorithm based on a novel solution space that can be explored with heuristic search, one of the foundations of modern planners. This paper evaluates the application of parallel search techniques to BFGP, another critical component in closing the performance gap. We first discuss why BFGP is well suited for parallelization and some of its differentiating characteristics from classical planners. Then, we propose two simple shared-memory parallel strategies with good scaling with the number of cores.
Duplicate Detection with GenAI
Customer data is often stored as records in Customer Relations Management systems (CRMs). Data which is manually entered into such systems by one of more users over time leads to data replication, partial duplication or fuzzy duplication. This in turn means that there no longer a single source of truth for customers, contacts, accounts, etc. Downstream business processes become increasing complex and contrived without a unique mapping between a record in a CRM and the target customer. Current methods to detect and de-duplicate records use traditional Natural Language Processing techniques known as Entity Matching. In this paper we show how using the latest advancements in Large Language Models and Generative AI can vastly improve the identification and repair of duplicated records. On common benchmark datasets we find an improvement in the accuracy of data de-duplication rates from 30 percent using NLP techniques to almost 60 percent using our proposed method.
Combining Embeddings and Domain Knowledge for Job Posting Duplicate Detection
Engelbach, Matthias, Klau, Dennis, Kintz, Maximilien, Ulrich, Alexander
Job descriptions are posted on many online channels, including company websites, job boards or social media platforms. These descriptions are usually published with varying text for the same job, due to the requirements of each platform or to target different audiences. However, for the purpose of automated recruitment and assistance of people working with these texts, it is helpful to aggregate job postings across platforms and thus detect duplicate descriptions that refer to the same job. In this work, we propose an approach for detecting duplicates in job descriptions. We show that combining overlap-based character similarity with text embedding and keyword matching methods lead to convincing results. In particular, we show that although no approach individually achieves satisfying performance, a combination of string comparison, deep textual embeddings, and the use of curated weighted lookup lists for specific skills leads to a significant boost in overall performance. A tool based on our approach is being used in production and feedback from real-life use confirms our evaluation.
- Europe > Germany > Baden-Württemberg > Stuttgart Region > Stuttgart (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Arkansas > Pulaski County > Little Rock (0.04)
- (4 more...)
MultiSiam: A Multiple Input Siamese Network For Social Media Text Classification And Duplicate Text Detection
Bhoi, Sudhanshu, Markhedkar, Swapnil, Phadke, Shruti, Agrawal, Prashant
Social media accounts post increasingly similar content, creating a chaotic experience across platforms, which makes accessing desired information difficult. These posts can be organized by categorizing and grouping duplicates across social handles and accounts. There can be more than one duplicate of a post, however, a conventional Siamese neural network only considers a pair of inputs for duplicate text detection. In this paper, we first propose a multiple-input Siamese network, MultiSiam. This condensed network is then used to propose another model, SMCD (Social Media Classification and Duplication Model) to perform both duplicate text grouping and categorization. The MultiSiam network, just like the Siamese, can be used in multiple applications by changing the sub-network appropriately.
Applying Machine Learning for Duplicate Detection, Throttling and Prioritization of Equipment Commissioning Audits at Fulfillment Network
Halawa, Farouq, Abdul, Majid, Mohammed, Raashid
VQ (Vendor Qualification) and IOQ (Installation and Operation Qualification) audits are implemented in warehouses to ensure all equipment being turned over in the fulfillment network meets the quality standards. Audit checks are likely to be skipped if there are many checks to be performed in a short time. In addition, exploratory data analysis reveals several instances of similar checks being performed on the same assets and thus, duplicating the effort. In this work, Natural Language Processing and Machine Learning are applied to trim a large checklist dataset for a network of warehouses by identifying similarities and duplicates, and predict the non-critical ones with a high passing rate. The study proposes ML classifiers to identify checks which have a high passing probability of IOQ and VQ and assign priorities to checks to be prioritized when the time is not available to perform all checks. This research proposes using NLP-based BlazingText classifier to throttle the checklists with a high passing rate, which can reduce 10%-37% of the checks and achieve significant cost reduction. The applied algorithm over performs Random Forest and Neural Network classifiers and achieves an area under the curve of 90%. Because of imbalanced data, down-sampling and upweighting have shown a positive impact on the models' accuracy using F1 score, which improve from 8% to 75%. In addition, the proposed duplicate detection process identifies 17% possible redundant checks to be trimmed.
Duplicate Detection as a Service
Opdenplatz, Juliette, Şimşek, Umutcan, Fensel, Dieter
Completeness of a knowledge graph is an important quality dimension and factor on how well an application that makes use of it performs. Completeness can be improved by performing knowledge enrichment. Duplicate detection aims to find identity links between the instances of knowledge graphs and is a fundamental subtask of knowledge enrichment. Current solutions to the problem require expert knowledge of the tool and the knowledge graph they are applied to. Users might not have this expert knowledge. We present our service-based approach to the duplicate detection task that provides an easy-to-use no-code solution that is still competitive with the state-of-the-art and has recently been adopted in an industrial context. The evaluation will be based on several frequently used test scenarios.
- Europe > Austria > Tyrol > Innsbruck (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Europe > France > Occitanie > Hérault > Montpellier (0.04)
- Information Technology > Information Management (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
A*+BFHS: A Hybrid Heuristic Search Algorithm
Bu, Zhaoxing, Korf, Richard E.
We present a new algorithm A*+BFHS for solving hard problems where A* and IDA* fail due to memory limitations and/or the existence of many short cycles. A*+BFHS is based on A* and breadth-first heuristic search (BFHS). A*+BFHS combines advantages from both algorithms, namely A*'s node ordering, BFHS's memory savings, and both algorithms' duplicate detection. On easy problems, A*+BFHS behaves the same as A*. On hard problems, it is slower than A* but saves a large amount of memory. Compared to BFIDA*, A*+BFHS reduces the search time and/or memory requirement by several times on a variety of planning domains.
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- Asia > Vietnam > Hanoi > Hanoi (0.04)